Beijing is one of the most expensive cities to own or rent a house in the world. The average property price of one square meter in Beijing is $553 per square feet(CEIC, 2020) and the average property price in Washington, DC is $515 per square foot(CNBC, 2020). The prices seem close in this case, but if we take a look at the price to income ratio which is how many years a person earning an average salary needs to work to own an average property. Beijing’s ratio is as high as 41.75% while the price to income ratio is only 4.92% in Washington, DC according to NUMBEO, the world’s largest cost of living database(NUMBEO, 2020). So this visualization project is aimed to represent the factors that may contribute to unusually high property price in Beijing and lay fundamental foundations for further analysis of interdisciplinary studies.
There were two dataset collected from Kaggle.com. The first dataset is originally fetched from Lianjia.com, which is one of the most well-known websites for real-estate brokerage market in Beijing. This dataset contains the housing price of Beijing from 2012 to 2017 with property conditions such as location, district, days on the market, number of bedrooms, size of the house, near a subway or not etc. This allows us to dig into the relationship of housing prices and house conditions and the trend of prices in recent years. The second dataset is the Airbnb rental activities in Beijing from 2008 to 2019. This dataset is used to facilitate the analysis done by first dataset. We will subset the data to the time range 2012 to 2017 in order to validate the previous analysis. Besides the basic house information including prices, locations, house conditions, it also contains the textual reviews from customers and the descriptions of the house, which will help us to understand the important aspects that customers value the most.
Dataset 1: Housing price in Beijing (https://www.kaggle.com/ruiqurm/lianjia)
Dataset 2: Airbnb Beijing 20190211 (https://www.kaggle.com/merryyundi/airbnb-beijing-20190211)
N. (2020, October 01). China: Property Price: YTD Avg: Beijing: Economic Indicators: CEIC. Retrieved December 11, 2020, from https://www.ceicdata.com/en/china/nbs-property-price-monthly/property-price-ytd-avg-beijing
NUMBEO. (2020). Property Prices. Retrieved October 22, 2020, from https://www.numbeo.com/property-investment/rankings.jsp
Clifford, C. (2018, August 11). Manhattan real estate is the most expensive in the US per square foot with some properties topping $10,000: Study. Retrieved December 11, 2020, from https://www.cnbc.com/2018/08/11/manhattan-real-estate-is-the-most-expensive-in-the-us-per-square-foo.html
Image retrieved from Booking.com
R 3.6.1
igraph 1.2.5 tidyverse 1.3.0 plyr 1.8.6
stringr 1.4.0 ggplot2 3.3.0 zoo 1.8.7
ggridges 0.5.2 viridis 0.5.1 hrbrthemes 0.8.0
survminer 0.4.8 leaflet 2.0.3 survival 3.1.11
wordcloud2 0.2.1 htmlwidgets 1.5.2
Python 3.7.4
pandas 1.1.4 plotly 4.6.0 jieba 0.42.1
numpy 1.17.3
There are three graphs showing the average property price trend in Beijing. The right graph is the average price variations of each district from 2012 to 2017 by month. There are sixteen districts in Beijing but due to the data limitations, we only have information of thirteen districts. As we can see, there is an overall increasing trend in this 5-year time frame. The average price per squared meter declined a little bit in mid 2014 but increased rapidly from 2015 to 2017. The average price reached the pick in early 2017 and then cooled down again until the end of the year. The decreasing price in 2017 is due to the tighter government measures implemented in late 2016 according to Global Property Guide (2020). Among all the district, Xicheng, Dongcheng, Haidian and Chaoyang districts have relatively high prices comparing to others.
The graph below is the histogram of number of houses sold each month from 2012 to 2017 and colored by the average price range. Comparing this graph with the price trend, it not hard to see that even though the price was rapidly increasing in 2015, 2016 and early 2017, people are buying more houses. But when the price dropped a lit in late 2017, there were less houses sold.
The last graph is the survival rate of houses on the market, which is the Lianjia.com website. The survival probability is the proportion of units that survive beyond a specified time. The houses with fewer followers have lower survival rate which make sense since the house would be unavailable after some one bought it. However, this also reflects the deal was made quick after the release of a house.
DELMENDO, L. C. (2020, March 28). Coronavirus in China. Retrieved December 11, 2020, from https://www.globalpropertyguide.com/Asia/China/Price-History
Time Series of Housing Price, data retrieved from kaggle.com
Line grpah of survival rate of houses, data retrieved from kaggle.com
Density plot of price by subway, data retrieved from kaggle.com
Density plot of price by elevator, data retrieved from kaggle.com
Density plot of price by house age, data retrieved from kaggle.com
In this section, there are three plots to compare the relationship between house conditions and price in different aspects. The top two graphs are the density plots of average house price by subway accessibility and elevator usability. The results are not very surprising. The houses that near subways or have elevators have higher average housing price in general.
The price on the left is the Ridgeline chart that shows the distribution of average price for different house age groups. It is interesting to notice that the older houses have the higher prices comparing to newer houses. After manually examining some of the records, we found out the old houses are mostly Siheyuan, which is a historical type of residence that located at the center of Beijing city.
We plotted the house locations on map with highlighted expensive houses. The expensive house was defined as the highest 15% of average prices in data. The data was randomly sampled from the data since it is unrealistic to plot the entire large original dataset. We also compared average house price per night of Airbnb houses. The results were quite different as we could see the two maps. On the top right graph, the expensive houses are centralized in the center of the city, where the Xicheng, Dongcheng, Haidian and Chaoyang districts locate.
On the bottom right map of Airbnb data, the expensive houses are more scattered at the rural area. We extracted some data and discovered that these houses are closed to famous sightseeings such as the Great Wall is located at northern Beijing.
The bottom left graph is the bar chart of number of houses sold in each district, colored by price range. Chaoyang has more houses comparing to Haidian, Dongcheng and Xicheng.
There are also some interesting insights we could derive from the Airbnb dataset. The bottom left is the number of houses existing in different district and colored by the price range. Similar to the Lianjia housing data, Chaoyang has highest number of houses in Beijing, followed by Haidian and Dongcheng districts but Xicheng have a lot lower number of houses available for rent comparing to Lianjia data.
The bottom right is a wordcloud of reviews by customers, we could get general sense of what aspects customers value the most. We parsed and extracted keywords from customer reviews and created the wordcloud. The largest word ‘胡同’ means Hutong, which is a famous type of alley or narrow street that built with historical architectures. It is a representative of Beijing city and is where Siheyuan usually located as introduced in the previous graphs. The other factors such as cleanliness, airline, pharmacy, elevator, nature, community shops are also something customers considering while choosing the house. Note: if the wordcloud is not loading, plase refresh the page.
The top right is a network graph where circle represents reviewers and square represents housing Ids. This is a sample graph of one customer and the house he has rented. We could plot this for more customers and houses. But in order to present the graph with clarity, we only chose one customer here.
Histogram of Airbnb house prices by District, data retrieved from kaggle.com
Network graph, data retrieved from kaggle.com
---
title: "Beijing Housing and Rental Price Analysis"
author: "Luwei Lei"
output:
flexdashboard::flex_dashboard:
source_code: embed
---
About
=====================================
Row {data-width=600}
-------------------------------------
### About
```{r}
```
#### Objective
Beijing is one of the most expensive cities to own or rent a house in the world. The average property price of one square meter in Beijing is \$553 per square feet(CEIC, 2020) and the average property price in Washington, DC is \$515 per square foot(CNBC, 2020). The prices seem close in this case, but if we take a look at the price to income ratio which is how many years a person earning an average salary needs to work to own an average property. Beijing's ratio is as high as 41.75% while the price to income ratio is only 4.92% in Washington, DC according to NUMBEO, the world's largest cost of living database(NUMBEO, 2020). So this visualization project is aimed to represent the factors that may contribute to unusually high property price in Beijing and lay fundamental foundations for further analysis of interdisciplinary studies.
#### Dataset
There were two dataset collected from Kaggle.com. The first dataset is originally fetched from Lianjia.com, which is one of the most well-known websites for real-estate brokerage market in Beijing. This dataset contains the housing price of Beijing from 2012 to 2017 with property conditions such as location, district, days on the market, number of bedrooms, size of the house, near a subway or not etc. This allows us to dig into the relationship of housing prices and house conditions and the trend of prices in recent years. The second dataset is the Airbnb rental activities in Beijing from 2008 to 2019. This dataset is used to facilitate the analysis done by first dataset. We will subset the data to the time range 2012 to 2017 in order to validate the previous analysis. Besides the basic house information including prices, locations, house conditions, it also contains the textual reviews from customers and the descriptions of the house, which will help us to understand the important aspects that customers value the most.
#### Source
Dataset 1: Housing price in Beijing (https://www.kaggle.com/ruiqurm/lianjia)
Dataset 2: Airbnb Beijing 20190211 (https://www.kaggle.com/merryyundi/airbnb-beijing-20190211)
#### References
N. (2020, October 01). China: Property Price: YTD Avg: Beijing: Economic Indicators: CEIC. Retrieved December 11, 2020, from https://www.ceicdata.com/en/china/nbs-property-price-monthly/property-price-ytd-avg-beijing
NUMBEO. (2020). Property Prices. Retrieved October 22, 2020, from https://www.numbeo.com/property-investment/rankings.jsp
Clifford, C. (2018, August 11). Manhattan real estate is the most expensive in the US per square foot with some properties topping $10,000: Study. Retrieved December 11, 2020, from https://www.cnbc.com/2018/08/11/manhattan-real-estate-is-the-most-expensive-in-the-us-per-square-foo.html
Row {data-width=400}
-------------------------------------
### Beijing City
```{r picture, echo = F, fig.cap = "Image retrieved from Booking.com"}
knitr::include_graphics("img/beijing.png")
```
### Required R packages
```{r}
```
R 3.6.1
igraph 1.2.5 tidyverse 1.3.0 plyr 1.8.6
stringr 1.4.0 ggplot2 3.3.0 zoo 1.8.7
ggridges 0.5.2 viridis 0.5.1 hrbrthemes 0.8.0
survminer 0.4.8 leaflet 2.0.3 survival 3.1.11
wordcloud2 0.2.1 htmlwidgets 1.5.2
Python 3.7.4
pandas 1.1.4 plotly 4.6.0 jieba 0.42.1
numpy 1.17.3
Price Trend {data-orientation=rows}
=====================================
Row {data-height=500}
-------------------------------------
### Summary of price trend
There are three graphs showing the average property price trend in Beijing. The right graph is the average price variations of each district from 2012 to 2017 by month. There are sixteen districts in Beijing but due to the data limitations, we only have information of thirteen districts. As we can see, there is an overall increasing trend in this 5-year time frame. The average price per squared meter declined a little bit in mid 2014 but increased rapidly from 2015 to 2017. The average price reached the pick in early 2017 and then cooled down again until the end of the year. The decreasing price in 2017 is due to the tighter government measures implemented in late 2016 according to Global Property Guide (2020). Among all the district, Xicheng, Dongcheng, Haidian and Chaoyang districts have relatively high prices comparing to others.
The graph below is the histogram of number of houses sold each month from 2012 to 2017 and colored by the average price range. Comparing this graph with the price trend, it not hard to see that even though the price was rapidly increasing in 2015, 2016 and early 2017, people are buying more houses. But when the price dropped a lit in late 2017, there were less houses sold.
The last graph is the survival rate of houses on the market, which is the Lianjia.com website. The survival probability is the proportion of units that survive beyond a specified time. The houses with fewer followers have lower survival rate which make sense since the house would be unavailable after some one bought it. However, this also reflects the deal was made quick after the release of a house.
DELMENDO, L. C. (2020, March 28). Coronavirus in China. Retrieved December 11, 2020, from https://www.globalpropertyguide.com/Asia/China/Price-History
### Price Trend of Average Beijing housing price by District
```{r message=FALSE, warning=FALSE, echo = FALSE}
library(tidyverse)
library(plyr)
library(stringr)
library(ggplot2)
library("zoo")
library(ggridges)
library(viridis)
library(hrbrthemes)
library(leaflet)
library("survminer")
require("survival")
library(wordcloud2)
library("htmlwidgets")
library(igraph)
```
```{r message=FALSE, warning=FALSE, echo = FALSE}
data = read_csv('data/housing.csv')
data = data %>% select (-c(url, id))
data <- rename(data, c("livingRoom" = "bedRoom"))
data <- rename(data, c("drawingRoom" = "livingRoom"))
```
```{r message=FALSE, warning=FALSE, echo = FALSE}
# data = drop_na(data)
data$district = as.factor(data$district)
data$district = revalue(data$district,
c("1" = "Dongcheng", "2" = "Fengtai", "3" = "Yizhuang", "4" = "Daxing", "5" = "Fangshan", "6" = "Changping", "7" = "Chaoyang", "8" = "Haidian", "9" = "ShijingShan", "10" = "Xicheng", "11" = "Tongzhou", "12" = "Mentougou", "13" = "Shunyi"))
```
```{r message=FALSE, warning=FALSE, echo = FALSE}
data = data[data$tradeTime > "2012-01-01", ]
data$year = format(as.Date(data$tradeTime, format="%Y-%m-%d"),"%Y")
# month = format(as.Date(data$tradeTime, format="%Y-%m-%d"),"%Y-%m")
data$month = format(as.Date(data$tradeTime, format="%Y-%m-%d"),"%Y-%m")
data$floor = as.numeric(str_split_fixed(data$floor, " ", 2)[,2])
```
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.width= 10, fig.cap = "Time Series of Housing Price, data retrieved from kaggle.com"}
df = data[, c("month", "price", "district")]
temp = aggregate( price ~ month + district, df, mean )
# df = data[, c("tradeTime", "price")]
# temp = aggregate( price ~ tradeTime, df, mean )
temp$month <- as.yearmon(temp$month , "%Y-%m")
temp$district = as.character(temp$district)
temp <-temp[order(temp$district),]
p <- ggplot(temp, aes(x=month, y=price )) +
geom_line(aes(colour=district)) +
scale_color_manual(values = c("#000000","#004949","#009292","#ff6db6","#ffb6db",
"#490092","#006ddb","#b66dff","#6db6ff","#b6dbff",
"#920000","#924900","#db6d00","#24ff24","#ffff6d"))+
theme_ipsum() +
labs(title = expression(paste("Beijing Average Housing Price (RMB/", m^{2},') Trend 2012-2017 by District')), x = "Year", y = "Average Price per Square Meter (RMB)")
# ggsave('img/timeseries.png', p, width = 300, units = "mm")
p
```
Row {data-height=500}
-------------------------------------
### Number of Houses Sold Every Year vs. Average Housing Price
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.width= 10, fig.cap = "Barplot plot of price by house age, data retrieved from kaggle.com"}
htmltools::includeHTML("img/housesbyYear.html")
```
### Survival Rate of Houses on the Market
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.width= 10, fig.cap = "Line grpah of survival rate of houses, data retrieved from kaggle.com"}
temp = data[, c("price", "DOM", "district", "followers")]
temp$followers <- cut(temp$followers, breaks=c(0, 2, 10, 20, 30, 1200))
temp = drop_na(temp)
fit <- survfit(Surv(DOM) ~ followers, data = temp)
p = ggsurvplot(fit, data = temp, ggtheme = theme_bw(),
palette = c("#AC3931", "#537D8D", "#87C38F", "#F4F0BB", "#F6AE2D"),
xlim = c(0, 800),
legend.title = "Strata",
legend = "right",
xlab = "Days in the Market",
title = "Survival Rate of houses by number of followers" )
p
```
House Conditions vs. Price {data-orientation=rows}
=====================================
Row
-------------------------------------
### Subway Accessability vs. Average Housing Price
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.width= 10, fig.cap = "Density plot of price by subway, data retrieved from kaggle.com"}
temp = data[, c("subway", "price")]
temp = drop_na(temp)
temp$subway = as.character(temp$subway)
temp$subway[temp$subway == "0"] <- "No"
temp$subway[temp$subway == "1"] <- "Yes"
mu2 <- ddply(temp, "subway", summarise, grp.mean=mean(price))
p = ggplot(data = temp, aes(x = price, color = subway, fill = subway)) +
geom_histogram(aes(y=..density..), position="identity", alpha = 0.3) +
geom_density(alpha=0.3)+
scale_fill_manual(values = c( "#AC3931", "#537D8D", "#CC79A7"))+
scale_color_manual(values = c("#AC3931", "#537D8D", "#CC79A7"))+
theme(text = element_text(size=12), plot.title = element_text( size=13))+
geom_vline(data = mu2, aes(xintercept=grp.mean, color=subway), alpha = 0.4, linetype="dashed")+
theme_ipsum() +
labs(title = "Density Plot of Average Housing Price by Subway Access", x = expression(paste("Price (RMB/", m^{2},')')))
p
```
### Elevator Usability vs. Average Housing Price
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.width= 10, fig.cap = "Density plot of price by elevator, data retrieved from kaggle.com"}
temp = data[, c("elevator", "price")]
temp = drop_na(temp)
temp$elevator = as.character(temp$elevator)
temp$elevator[temp$elevator == "0"] <- "No"
temp$elevator[temp$elevator == "1"] <- "Yes"
mu2 <- ddply(temp, "elevator", summarise, grp.mean=mean(price))
p = ggplot(data = temp, aes(x = price, color = elevator, fill = elevator)) +
geom_histogram(aes(y=..density..), position="identity", alpha = 0.3) +
geom_density(alpha=0.3)+
scale_fill_manual(values = c( "#AC3931", "#537D8D", "#CC79A7"))+
scale_color_manual(values = c("#AC3931", "#537D8D", "#CC79A7"))+
theme(text = element_text(size=12), plot.title = element_text( size=13))+
geom_vline(data = mu2, aes(xintercept=grp.mean, color=elevator), alpha = 0.4, linetype="dashed")+
theme_ipsum() +
labs(title = "Density Plot of Average Housing Price by Elevator Access", x = expression(paste("Price (RMB/", m^{2},')')))
p
```
Row
-------------------------------------
### Housing Age vs. Average Housing Price
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.width= 8, fig.cap = "Density plot of price by house age, data retrieved from kaggle.com"}
temp1 = data[, c("constructionTime", "price", "year")]
temp1 = drop_na(temp1)
temp1$HouseAge = as.numeric(temp1$year) - as.numeric(temp1$constructionTime)
temp1 = temp1[(temp1$HouseAge >= 0) & (temp1$HouseAge < 107), ]
temp1$HouseAge <- cut(temp1$HouseAge, breaks=c(0,10,20,30,40,60,106), labels=c("Less than 10 years","10-20 years","20-30 years", "30-40 years", "40-60 years", "More than 60 years"))
temp1 = drop_na(temp1)
p = ggplot(temp1, aes(x = price, y = HouseAge, fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_viridis(name = "Temp. [F]", option = "C") +
labs(title = expression(paste("Average House Price (RMB/", m^{2},') from by House Age')) ) +
theme_ipsum() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 10)
)
p
```
### Summary of housing conditions
```{r}
```
In this section, there are three plots to compare the relationship between house conditions and price in different aspects. The top two graphs are the density plots of average house price by subway accessibility and elevator usability. The results are not very surprising. The houses that near subways or have elevators have higher average housing price in general.
The price on the left is the Ridgeline chart that shows the distribution of average price for different house age groups. It is interesting to notice that the older houses have the higher prices comparing to newer houses. After manually examining some of the records, we found out the old houses are mostly Siheyuan, which is a historical type of residence that located at the center of Beijing city.
House Locations
=====================================
Row {data-width=400}
-------------------------------------
### Summary of house locations vs. price
```{r}
```
We plotted the house locations on map with highlighted expensive houses. The expensive house was defined as the highest 15% of average prices in data. The data was randomly sampled from the data since it is unrealistic to plot the entire large original dataset. We also compared average house price per night of Airbnb houses. The results were quite different as we could see the two maps. On the top right graph, the expensive houses are centralized in the center of the city, where the Xicheng, Dongcheng, Haidian and Chaoyang districts locate.
On the bottom right map of Airbnb data, the expensive houses are more scattered at the rural area. We extracted some data and discovered that these houses are closed to famous sightseeings such as the Great Wall is located at northern Beijing.
The bottom left graph is the bar chart of number of houses sold in each district, colored by price range. Chaoyang has more houses comparing to Haidian, Dongcheng and Xicheng.
### Number of houses sold in different district by price range
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.width= 10}
temp1 = data[, c("district", "price")]
temp1 = drop_na(temp1)
colnames(temp1)[1] = "District"
temp1$price <- cut(temp1$price, breaks=c(0,20000,30000,50000,100000,160000), labels=c("Less than ¥20k","¥20k-30k","¥30k-50k", "¥50k-100k", "More than ¥100k"))
temp1 = drop_na(temp1)
temp1 = aggregate(temp1, by=list(temp1$District, temp1$price), FUN=length)
temp1 = subset(temp1, select = - c(price))
colnames(temp1) = c("District", "Price", "Count")
temp1$District = as.character(temp1$District)
temp1 <-temp1[order(temp1$District),]
g = ggplot(temp1, aes(fill=Price, y=Count, x=District)) +
geom_bar(position="stack", stat="identity") +
scale_fill_viridis( discrete = T) +
scale_fill_manual(values= c("#AC3931", "#537D8D", "#87C38F", "#F4F0BB", "#F6AE2D")) +
ggtitle(expression(paste("Distribution of Average House Price (RMB/", m^{2},') by District'))) +
theme_ipsum() +
xlab("District") +
theme(axis.text.x = element_text(angle = 45))
g
```
Row {data-width=600}
-------------------------------------
### Random Sampled House locations on map
```{r }
sample_data = drop_na(data)
sample_data = sample_data[sample(nrow(sample_data), 5000), ]
expensive_housing = sample_data %>% filter(price > 100000)
# part_expensive = expensive_housing[sample(nrow(expensive_housing), 1000), ]
leaflet(data = sample_data) %>%
addProviderTiles('CartoDB') %>%
# addProviderTiles('OpenStreetMap.DE') %>%
addCircleMarkers(lat = ~Lat, lng = ~Lng, radius = 2, opacity = 0.3, fillOpacity = 0.3, group = "Randomly Sampled Houses in Beijing", color = '#3F8EFC') %>%
addCircleMarkers(data = expensive_housing, lng = ~Lng, lat = ~Lat, color = '#DD6E42', opacity = 0.5, fillOpacity = 0.5, radius =2, group = "Expensive Housing") %>%
# addMarkers(data = good_school, lng = ~Longitude, lat = ~Latitude, popup = ~学校名称, group = "Good Elementary Schools") %>%
addLayersControl(
overlayGroups = c("Expensive Housing", "Randomly Sampled Houses in Beijing"), options = layersControlOptions(collapsed = FALSE))
```
### Random Sampled Airbnb locations on map
```{r }
list = read_csv('data/airbnb/listings.csv')
list = list %>% select(id, summary, description, neighborhood_overview, notes, transit, house_rules, neighbourhood_cleansed, latitude, longitude, property_type, room_type, price, weekly_price, monthly_price, host_since, review_scores_rating, last_review)
list = list[list$host_since > '2012-01-01',]
list = list[list$last_review < '2018-01-01',]
list$price = extract_numeric(list$price)
list$neighbourhood_cleansed <- mapvalues(list$neighbourhood_cleansed,
from=c("海淀区","朝阳区 / Chaoyang","东城区", "石景山区", "顺义区 / Shunyi", "通州区 / Tongzhou", "丰台区 / Fengtai","怀柔区 / Huairou", "西城区", "昌平区", "大兴区 / Daxing", "延庆县 / Yanqing", "密云县 / Miyun","房山区","平谷区 / Pinggu", "门头沟区 / Mentougou"),
to=c("Haidian","Chaoyang","Dongcheng","ShijingShan", "Shunyi","Tongzhou", "Fengtai", "Huairou", "Xicheng", "Changping", "Daxing", "Yanqing", "Miyun", "Fangshan","Pinggu", "Mentougou"))
list = list[(list$neighbourhood_cleansed !="Pinggu") & (list$neighbourhood_cleansed !="Yanqing")& (list$neighbourhood_cleansed !="Miyun")& (list$neighbourhood_cleansed !="Huairou"), ]
```
```{r }
sample_data = list %>% select(latitude, longitude, neighbourhood_cleansed, price)
sample_data = drop_na(sample_data)
expensive_housing = sample_data %>% filter(price > 1000)
# part_expensive = expensive_housing[sample(nrow(expensive_housing), 1000), ]
leaflet(data = sample_data) %>%
addProviderTiles('CartoDB') %>%
addCircleMarkers(lat = ~latitude, lng = ~longitude, radius = 2, opacity = 0.5, fillOpacity = 0.5, group = "Randomly Sampled Houses in Beijing", color = '#3F8EFC') %>%
addCircleMarkers(data = expensive_housing, lng = ~longitude, lat = ~latitude, color = '#DD6E42', opacity = 0.5, fillOpacity = 0.5, radius =2, group = "Expensive Housing") %>%
addLayersControl(
overlayGroups = c("Expensive Housing", "Randomly Sampled Houses in Beijing"), options = layersControlOptions(collapsed = FALSE))
# saveWidget(p, "airbnbPrice.html")
# system('mv airbnbPrice.html img/airbnbPrice.html')
```
Airbnb Insights
=====================================
Row {data-width=500}
-------------------------------------
### Summary of Airbnb House Price
```{r}
```
There are also some interesting insights we could derive from the Airbnb dataset. The bottom left is the number of houses existing in different district and colored by the price range. Similar to the Lianjia housing data, Chaoyang has highest number of houses in Beijing, followed by Haidian and Dongcheng districts but Xicheng have a lot lower number of houses available for rent comparing to Lianjia data.
The bottom right is a wordcloud of reviews by customers, we could get general sense of what aspects customers value the most. We parsed and extracted keywords from customer reviews and created the wordcloud. The largest word ‘胡同’ means Hutong, which is a famous type of alley or narrow street that built with historical architectures. It is a representative of Beijing city and is where Siheyuan usually located as introduced in the previous graphs. The other factors such as cleanliness, airline, pharmacy, elevator, nature, community shops are also something customers considering while choosing the house. Note: if the wordcloud is not loading, plase refresh the page.
The top right is a network graph where circle represents reviewers and square represents housing Ids. This is a sample graph of one customer and the house he has rented. We could plot this for more customers and houses. But in order to present the graph with clarity, we only chose one customer here.
### Number of Airbnb houses in different district by price range
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.width= 10, fig.cap = "Histogram of Airbnb house prices by District, data retrieved from kaggle.com"}
df = list %>% select(neighbourhood_cleansed, price)
# df = df[df$price < 3000, ]
colnames(df)[1] = "District"
df$price <- cut(df$price, breaks=c(0,200,300,500,600,70006), labels=c("Less than $200","$200-300","$300-500", "$500-600", "More than $600"))
df = drop_na(df)
df = aggregate(df, by=list(df$District, df$price), FUN=length)
df = subset(df, select = - c(price))
colnames(df) = c("District", "Price", "Count")
df <-df[order(df$District),]
ggplot(df, aes(fill=Price, y=Count, x=District)) +
geom_bar(position="stack", stat="identity") +
scale_fill_viridis( discrete = T) +
scale_fill_manual(values= c("#AC3931", "#537D8D", "#87C38F", "#F4F0BB", "#F6AE2D")) +
ggtitle("The Airbnb house prices($/night) distribution by District ") +
theme_ipsum() +
xlab("District")+
theme(axis.text.x = element_text(angle = 45))
```
Row {data-width=500}
-------------------------------------
### Network
```{r message=FALSE, warning=FALSE, echo = FALSE, fig.cap = "Network graph, data retrieved from kaggle.com"}
reviews = read_csv('data/airbnb/reviews.csv')
reviews = reviews %>% filter(reviewer_id == 1)
reviews = subset(reviews, select = c('listing_id', 'reviewer_id', 'reviewer_name'))
reviews$listing_id = round(reviews$listing_id / 1000)
links=data.frame(
source= c(reviews$reviewer_name, 7121),
target= c(reviews$listing_id, 'others')
)
# Turn it into igraph object
network <- graph_from_data_frame(d=links, directed=F)
# Count the number of degree for each node:
deg <- degree(network, mode="all")
# Plot
plot(network,vertex.size=c(70, 50, 50, 50, 50, 50, 50, 50, 50), edge.width= 5, vertex.shape=c("circle","square", "square", "square", "square", "square", "square","square", "circle"), vertex.color=rgb(0.1,0.7,0.8,0.5) )
text("topleft","mtcars network",col="black", cex=1.5)
legend("bottomleft", legend=levels(as.factor(c("House id", "Reviewer name"))) , bty = "n", pch = c(22, 21), pt.bg = c(rgb(0.1,0.7,0.8,0.5), rgb(0.1,0.7,0.8,0.5)), pt.cex = 1.5, horiz = FALSE, inset = c(0.1, 0.1))
```
### wordcloud
```{r}
# htmltools::includeHTML("img/wordcloud.html")
worddf = read_csv('data/cleaned_text.csv')
worddf = subset(worddf, select = -c(X1) )
worddf = worddf[(worddf$count < 1800) & (worddf$count > 100), ]
a = wordcloud2(data=worddf, size=1.6, color = "random-dark")
a
```